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Abstract 

Given n noisy observations §i of the same quantity /, it is common 
use to give an estimate of / by minimizing the function Y^i=\{9i ~ f) 2 - 
From a statistical point of view this corresponds to computing the 
Maximum Likelihood estimate, under the assumption of Gaussian 
noise. However, it is well known that this choice leads to results 
that are very sensitive to the presence of outliers in the data. For 
this reason it has been proposed to minimize functions of the form 
Yli=i V(.9i ~ f)i where V is a function that increases less rapidly than 
the square. Several choices for V have been proposed and success- 
fully used to obtain "robust" estimates. In this paper we show that, 
for a class of functions V, using these robust estimators corresponds 
to assuming that data are corrupted by Gaussian noise whose vari- 
ance fluctuates according to some given probability distribution, that 
uniquely determines the shape of V. 

© Massachusetts Institute of Technology, 1996 

This paper describes research done within the Center for Biological Information Process- 
ing, in the Department of Brain and Cognitive Sciences, and at the Artificial Intelligence 
Laboratory. This research is sponsored by a grant from the Office of Naval Research 
(ONR), Cognitive and Neural Sciences Division; by the Artificial Intelligence Center of 
Hughes Aircraft Corporation (Sl-801534-2). Support for the A. I. Laboratory's artificial 
intelligence research is provided by the Advanced Research Projects Agency of the Depart- 
ment of Defense under Army contract DACA76-85-C-0010, and in part by ONR contract 
N00014-85-K-0124. 



1 Introduction 

A common problem in statistics is the following: given n noisy observa- 
tions §i of the same quantity /, give an estimate of /. A typical solution 
to this problem consists in choosing the value of / that maximes the like- 
lihood function P(g\f), that is the probability of having observed the data 
g = (<7i, . . . ,g n ) if the true value was /. Estimates of this type are named 
Maximum Likelihood (ML) estimates, and rely on the assumption that we 
know the likelihood function P(g\ } . . . } g n \f) } that is essentially a model of 
how noise affected the measure process. 

A common assumption is that of additive Gaussian noise, in which we 
assume that the measurement gi are related to the true value by the relation 

ft = / + ev , i = l,...,ra , 

where e 8 - are independent random variables with given gaussian probability 
distributions P 8 (e 8 ) of variance <7 8 - and zero mean. In this case the likelihood 
function is 



p(^i,...,^i/) = n^( e o = n\/- e 



& --Pi(9i-f) 2 



8=1 8=1 " 

where /3 8 - = ^j- Maximizing the likelihood function (1) corresponds therefore 

i 

to solve the following minimization problem: 

N 

min £#G7.--/) 2 . (2) 

' 8=1 

An elementary computation shows that the solution is the weighted av- 
erage of the data: 

f EF=i/%8 



£?=i A 

The ML estimate has therefore a simple meaning and it is easy to com- 
pute. However, it is well known that estimates of this type are not "robust", 
that is are they very sensitive to the presence of outliers in the data. In order 
to overcome this difficulty it has been proposed to use a modified version of 
the minimization problem (2): 

N 

mmE%-/), (3) 

1 8 = 1 
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Figure 1: Different choices for the function V . 
x < 0.5 and then constant. (2) V(x) = \x\. (3) V(x 
and then linear, with continuous hrst derivative. 



f) V(x) is quadratic for 
is quadratic for x < 0.5 



where the quadratic function (g 8 - — f) 2 has been substituted by some other 
less rapidly increasing even function V. Estimator of this type are known 
in statistics, for particular choices of V, as robust estimators (Huber, 1981). 
The idea underlying (3) is that if the error (g 8 - — f) 2 is large, it is likely that §{ 
is an outlier, so that we do not want to enforce / to be close to it. Therefore 
the function V should not increase much after a certain value. Different 
shapes for V have been proposed, and some of them have been depicted in 
hgure (1). 

In this paper we want to give a more rigorous justification for the use of es- 
timates like the one of eq. (3), and also to give an interpretation of the model 
of noise to which they correspond. We will see that if the function e~ v ^ x ' 
is completely monotone, then using eq. (3) corresponds to assuming that 
our measures are affected by a Gaussian noise whose variance is a random 
variable with given probability distribution. Depending on the probability 
distribution of the variance of the noise, different shapes for V are obtained. 
For a particular choice of V a justification of such a technique was given in 
(Girosi, Poggio and Caprile, 1991), but no characterization was given. In the 
next section we formalize these statements, while in the following sections 
we present a large class of functions V that can be used, together with some 



examples. 

2 Robust Maximum Likelihood Estimates 

In order to simplify the notation we consider the problem presented in the 
previous section in which only one measurement g is done, since this does 
not change the main conclusions. We therefore assume that 

g = f + £ (4) 

where e is a random variable whose distribution is Gaussian with zero mean 
and variance a. The likelihood function is therefore 



P{g\f) = \l'-e- p(9 - fr (5) 

where f3 = -\. 

When we compute the standard maximum likelihood estimate we are as- 
suming that the variance of the noise has a fixed value, but this assumption 
is not always realistic. In fact, in many cases the accuracy of the measure- 
ment apparatus can fluctuate, due to some external causes, and in these 
cases our data can contain outliers. A more realistic assumption consists in 
considering the variance of the Gaussian noise, and therefore /3, as a random 
variable, with given distribution P(f3). We are therefore led to introduce the 
probability P(g\f } f3) of having observed the data g if the true value was / 
and the variance of the noise was a = -K=: 




P{g\f,P) = \ , -e-™-». (6) 



Notice that the right hand side of eq. (6) is the same of eq. (5), but their 
meaning is different. We can now compute the joint probability P(g } /3\f) 
of having observed the data g in presence of gaussian noise with variance 
a = -^=, if the true value was f: 

P{g,P\f) = P{9\f,P)P{P)- (7) 

Since we are not interested in estimating /3, but we are interested only 
in the probability of g given /, that is our likelihood function, we integrate 
equation (7) over f3 to obtain the effective noise distribution 



roo 1 poo 9 i — 

P*(g\f) = / P(g\f, (3) P((3)d(3 = -= / e"^-') 03 P(J3) d(3 (8) 
Jo y 7T Jo v 

The MAP estimate is now obtained by maximizing the probability of eq. 
(8), or, taking the negative of its logarithm, solving the following minimiza- 
tion problem: 

min V(g - f) (9) 

where we have defined the so called effective potential 1 

V(x) = -\n e~ (ix J/3 P((3)d(3 . (fO) 



/o 

In the case in which n observations </i, . . . , g n have been taken the same 
considerations apply, and assuming that the variances <7 8 - of the measurements 
gi have all the same probability distribution, we obtain, instead of eq. (9): 

N 

min $>(<fc- - /) . (II) 

1 8 = 1 

This equation coincides with eq. (3), that has been proposed has a tech- 
nique to "robustize" the least-square estimate (2). In our case, however, 
the effective potential V derives from specific assumptions on how data are 
corrupted by noise. If the distribution of the random variable (3 is a delta 
function centered on some value /3, that is if P(/3) = S(f3 — f3) } the noise model 
is Gaussian with fixed variance, and the effective potential is a quadratic func- 
tion, yielding the same result of eq. (2). For other probability distributions 
P(/3), formula (10) allows to compute the corresponding effective potential 
by simply performing a one dimensional integration. Conversely, in some 
cases, given an effective potential V(x), it is also possible to understand if 
there is any probability distribution P(/3) that corresponds to it. In the next 
section we introduce a class of effective noise distributions for which such a 
characterization can be given. 

3 A class of effective noise distributions 

In this section we study and characterize a class of effective noise distri- 
butions. Since we want to maximize the effective noise distribution (8) we 



^^This name was previously introduced by Geiger and Girosi (1991), that used a similar 
technique applied at the problem of surface reconstruction with discontinuities. 



are not interested in effective distributions that are unbounded. It will turn 
out that if an effective noise distribution is bounded at the origin it is also 
bounded on all the real axis. Therefore, according to eq. (8) we define the 
bounded effective noise distributions as the probability distributions of the 
form: 

/(*) = jT 'e-^VWM/? (12) 

where P(j3) is a probability distribution, and such that the following condi- 
tion is satisfied: 

/(0) = jH yfpP(PW < +^ . 
We can now prove the following proposition: 

Proposition 3.1 A probability distribution f(x) is a bounded effective noise 
distributions if and only if f(y^x) is completely monotone. 



Proof: (only if) Suppose f(x) is a bounded effective noise distribution. 
Then f(y /f x) it can be represented as 



oo 

fix, 



f(^) = / e- p *d(i(P) 
Jo 



where 



Jo 
Since fi(/3) is clearly non decreasing and bounded, then by the Bern- 
stein's theorem on the representation of completely monotone functions (see 
Appendix A), f(y^x) is completely monotone. 



(if) Suppose that the probability distribution f(x) is such that f(y /r x) is 
completely monotone. Then it can be represented as 

f(x) = / e-P* dp(P) , (13) 

Jo 

with fi(/3) non decreasing and bounded. Since / is a probability distribution 
its integral over the real axis has unit value, and therefore 

+ 00 /"OO / /"OO 



f( x ) = 2 r ( r e- f3x2 dfi(i3)] dx 



Exchanging the order of integration and evaluating the gaussian integral, 
we obtain that 

^ KIJ c, 0<c<+oo. 



/o y^ 
Therefore it is always possible to write 



dfi(P) = P({3) yJ/3 d/3 , 

where P(P) is a probability distribution, being positive and having finite 
integral. Substituting this expression in formula (13) we obtain the rep- 
resentation of eq. (12). Noticing that completely monotone functions are 
bounded at the origin, since 



/(0) = / dfi(P) < +^ , 
Jo 

we conclude that f(x) is a bounded effective noise distribution. □ 

We can now answer to the question if effective potentials of the type V(x) = 
\x\ p can be derived in this framework. In fact, using the previous proposition 
it is sufficient to check if the probability distribution P(x) = e~' x ' P is such 
that P(y / i : ) is completely monotone. Using the fact that the function e~' x ' P 
is completely monotone if and only if < p < 1 (Schoenberg, 1937) (see 
appendix A), we can immediately derive the following proposition: 

Proposition 3.2 The function V(x) = \x\ p is the effective potential asso- 
ciated to a bounded effective noise distribution if and only if < p < 2. 



We notice that if we set p = 1 in the proposition above we obtain as 
effective potential the usual T\ error measure, that is V(x) = \x\, is obtained. 
However, since the function absolute value is not differentiable at the origin 
it has been proposed to use functions that behave quadratically in a neighbor 
of the origin, and linearly for large values of the argument (Eubank, 1988). 
Effective potentials of the form V(x) = \x\ p are interesting, since they are 
convex and the problem of maximizing the likelihood function has therefore 
only one solution. However, before showing what are the effective noise 
distributions that are associated to this effective potentials, we present a 
more simple example, that gives a non convex effective potential that has 
also been used in practice. 



4 A class of non convex effective potentials 

We have already seen that if the distribution P(P) is a delta function the 
standard quadratic potential is obtained. The simplest non trivial case con- 
sists in assuming that P(P) is a sum of two delta functions, that is 

P(P) = (1 - e)8(/3 - ft) + e8(P - ft) (14) 

where e is a parameter between and 1 and ft, ft are fixed positive numbers, 
ft > ft. If ft is a very small number such a distribution can represent the a 
priori knowledge that a fraction e of the data is very unreliable. In the limit 
of ft going to zero this fraction of data is constituted by genuine outliers, 
and we therefore analyze the model keeping in mind that we are interested 
in this limit. 

With the noise distribution given by eq. (14) the effective potential be- 
comes 

V{x) = -\n J Pe~ fjx [(1 - e)6(P - ft) + eS(P - ft)] J/3 (15) 



o 



and, after some algebra: 




V(x) = p i x 2 -ln[l + \-T e '] > ( 16 ) 



where we have neglected unimportant constant terms. 

We start studying the behavior of the potential in a neighbor of the origin. 
Taking a Taylor's expansion up to the second order, after some algebra we 
find that 

V(x) = V{0) + p 2 x 2 + o(x 3 ) 

so that the potential is initially quadratic, and very flat if ft is small, that 
is if we assume that outliers are present in the data. 

When x goes to infinity the exponential term in the logarithm of eq. (16) 
grows very fast, (remember that ft > ft), and the unit term can be omitted, 
leading to major simplifications. This is true only if ft is "not to small", in 
the sense that the following inequality has to verified: 

x 2 »k(e,p i )-UnP 2 (17) 

where &(e, ft) is a constant that depends only e and ft, whose exact form is 
irrelevant to us. In the region where this condition is satisfied we therefore 
obtain: 
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beta2=0.01 



beta2=0 .05 



beta3=0. 10 



Figure 2: The non convex effective potential of the model above for different 
values of ft. 



V(x) ^ft:c 2 -ln&(e,ft) 



and therefore: 



2 In ft 



x 2 (^ - ft) 



V(x) ^ft:c 2 -ln&(e,ft) 



5 loft 



For large values of x and small values of ft, where "large" and "small" 
have to be intended in the sense of condition (17), the effective potential is 
again quadratic and very flat. 

In summary: for small values of x the potential is a very flat parabola, for 
large values of x is the same parabola, but translated of a positive amount 
that grows logarithmically with ft, and in between, since its hrst derivative 
is strictly positive, it smoothly connects these two behaviors. In fig. (2) we 
show the shape of the effective potential for fixed e and ft, for three differ- 
ent values of ft. We set e = 0.1, ft = 4, and ft G {0.1, 0.05, 0.01}. This 
amounts to say that we know a priori that 90% of the data points are affected 
by Gaussian noise of variance equal to 0.5 (that is J\)- The other 10% is af- 
fected by Gaussian noise with very large variance, that is a = 3.16, 4.47, 10. 
We notice that for a value of ft = 0.01, that corresponds to a variance 



(T = 10, the effective potential is extremely flat, almost constant. A similar 
behavior is expected: in fact it means that when the interpolation error is 
larger than a threshold its influence on the solution is not taken in account 
anymore, and this is exactly the kind of motivation that led statisticians to 
consider robust models. 



5 A class of convex effective potentials 

We now consider an effective potential of the form 

V(x) = V« 2 + x 2 . (18) 

where a is some given parameter, possibly zero. Functions of this type are 
well known in approximation theory by the name of "multiquadrics" , or 
"Hardy's multiquadrics", and their behavior is shown in fig. (3). Potentials 
of this shape are interesting because they are convex, so that the minimiza- 
tion problem associated with them has a unique solution. Moreover, poten- 
tials with a shape very similar to this one can be implemented in analog 
VLSI circuits (Harris, 1990), allowing very fast ways to solve the estimation 
problems. 

We are interested in finding the probability distribution that leads to this 
form of effective potential. A solution to this problem certainly exists, since 
it is easy to show that e~ v ^^' is completely monotonic. Therefore we have 
to find a function P(/3) such that 



e~ Va2+x2 = / e- x2(5 ^{p)P{p) d/3, 

This is in essence the problem of computing an inverse Laplace transform. 
We start from the following identity (Gradshteyn and Ryzhik, 1981): 



2y^e-^ = p-ie'-pe-* 13 d/3 (19) 

Jo 

and perform the substitution x — > x 2 + a 2 , obtaining 

2 v ^e- VxTT ^ = /°°/rf e -^ e -^ 2 -^ 2 d/3 . (20) 



/o 

Making the proper identifications in equation above, and paying attention 
to normalization factors, we obtain as a result: 

P{/3) = i- e -i?-^ 2 +« . (21) 



X 

> 




Figure 3: The multiquadric effective potential V(x) 
different values of the parameter a. 
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Figure 4: The distribution of variance P(cr) associated to the multiquadric 
effective potential. 



We can now derive the distribution P(&) of the variance a = -j=, impos 



ing 



P(f3)df3 = P(<j)d<j , (3 
After some algebra we obtain the function 



m\ 



G c 



P[ 



(J 



2 2 

T 2" 



Zae 4 ' 



whose shape is depicted in fig, (4) for three different values of a. We no- 
tice that when a increases the distribution becomes more peaked, and also 
flatter around the origin. Therefore the probability of having low-noise data 
decreases when a increases. Equivalently, we can also say that the proba- 
bility of having data with noise larger than a given threshold increases with 



a. 



6 Conclusions 

We have shown that it is possible to give a simple interpretation to estimators 
based on the solution of the minimization problem 



N 



m j n J2 V (& ~ f) 



(23) 



i=i 



ff 



where V is an appropriate function, that we call effective potential. If the 
function e~ v ^^' is completely monotone, using these robust estimators cor- 
responds to compute Maximum Likelihood estimators under the assumption 
that data are corrupted by Gaussian noise whose variance fluctuates accord- 
ing to a given probability distribution, that uniquely determines V . Typical 
"effective potentials" V, that have been used in the past, belongs to the class 
we consider. 

We notice that the result we derived holds also in the more general settings 
context of parametric and non parametric regression. In order to see why, 
let g = {(x 8 ,j/ 8 ) G R n X RJiLi be a set of data that has been obtained 
by randomly sampling a multivariate function / in presence of noise. In 
parametric regression we assume that / is a parametric function /j(x;p), 
where p G i? m , and the optimal set of parameters p is usually recovered by 
minimizing the least square error 

n 

min$>«-Mx,-;p)) 2 . (24) 

pfc 1=1 

As in the case considered in this paper, this can be thought as a maximum 
likelihood estimator, under the assumption of Gaussian noise with fixed vari- 
ance. Therefore the same argument we applied in section (2) applies here, 
and more robust estimates could be obtained if we replace the quadratic 
function in eq. (24) with an effective potential V . 

In non parametric regression no assumption is made on the specific form 
of /, and a common technique consists in solving the following minimization 
problem: 

N 

min £(/(x,-) - yif + XS[f] . 

' 8 = 1 

where S[f] is an appropriate convex functional of / and A a positive number. 
This correspond to compute the Maximum A Posteriori estimator, under 
the assumption of Gaussian noise and a priori probability for / given by 
P(f) oc e~ A5 ^. If we assume that the variance of the Gaussian noise is a 
random variable, using the same argument we used in section (2) we can prove 
that the Maximum A Posteriori estimator solves the following minimization 
problem: 

N 

min J2V(f(^)- yi ) + XS[f]. (25) 

' 8 = 1 
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where V is an effective potential. Estimators of this type are known in the 
statistical literature as M-type smoothing splines (Eubank, 1988), and their 
implementation in analog VLSI circuits has been considered by J. Harris 
(1991) for some choices of the functional S[f]. 

A Completely Monotone Functions 

We need to give the following: 

Definition A.l A function f is said to be completely monotonic on (0,oo) 
provided that it is C°°(0,oo) and (-l)'0(x) > 0, Vi G (0, oo), V7 G j\f , 
where J\f is the set of natural numbers. 

A typical example of completely monotone function is the exponential 
function f(x) = e~ ax } with a > 0. It turns out that all the completely mono- 
tone functions are linear superpositions with positive coefficients of scaled 
exponentials, as the following theorem of Bernstein shows: 

Theorem A.l (Bernstein, 1929) The class of completely monotone func- 
tions is identical with the class of functions of the form 

g(x) = / e"^ dfi(/3), 
Jo 



where fi(/3) is non-decreasing and bounded for f3 > 0. 
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